Welcome to the third part of our 16SrRNA Amplicon Sequencing. Here, we will take a look into the diversity of ASVs within each of the samples, and squeeze this dataset to extract as much information as possible from it.

The way we compare microbial communities is using tools from community ecology, and we have heaps of theoretical and practical knowledge about them because they were originally developed for ecology of macroorganisms. With R, calculating the metrics is much easier than understanding what they mean, so pay attention to what each metric means when you’re interpreting the results!

Please take your time to read all instructions carefully, prepare your scripts and understand what you want to achieve with each command.

1 Setting up your environment

Before we begin, set the paths in your local environment on the server by replacing these paths with the ones that point to the directory where you want to store the results from today’s tutorial, and to the phyloseq object you saved in the previous part 2. We will also need to reload the Metadata table, which has been cleaned (Metadata_kefir_modified.csv).

#set path in plain
PathToWD <- (".")
knitr::opts_knit$set(root.dir = normalizePath(PathToWD)) 
PathToWD <- ("/Users/admin/Documents/UNIL/Enseignement/SAGE/SAGE2")
path2ps <- paste(PathToWD, "04_Phyloseq_object/", sep="/")
setwd(path2ps)
ps <- readRDS("PhyloSeq_Object.rds")
metaKefir <- read.table(file="Metadata_kefir_modified.csv",sep=",",header=TRUE)
rownames(metaKefir) <- metaKefir$id_sample

path2dv <- paste(PathToWD, "05_Diversity", sep="/")
dir.create(path2dv)
setwd(path2dv)

1.1 Loading phyloseq

We will use the phyloseq package, which was specifically developed for the analysis of microbial communities. We will need also some extensions and other packages.

library(phyloseq)
library(phyloseq.extended)
library(ggplot2)
library(ggpubr)
library(genefilter)
library(vegan)
library(ampvis)

2 The phyloseq object

To begin, you will need the phyloseq object you should have constructed at the end of the Part 2. You’ve already loaded it above, but here is the code again in case you need to reload a fresh one:

setwd(path2ps)
ps <- readRDS("PhyloSeq_Object.rds")
sample_data(ps) <- metaKefir
path2dv <- paste(PathToWD, "05_Diversity", sep="/")
setwd(path2dv)

# Your phyloseq object is now in *ps*
sample_data(ps) <- metaKefir
ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 3943 taxa and 35 samples ]
## sample_data() Sample Data:       [ 35 samples by 12 sample variables ]
## tax_table()   Taxonomy Table:    [ 3943 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 3943 reference sequences ]

This is a special R object that must contain the following:

  • phyloseq object
    • otu_table : This is an R matrix contains your ASV x Sample matrix, where the columns are your samples, rows are each of the ASVs you identified with DADA2, and the cells are populated with the abundance (in number of reads) of each ASV in each sample.
    • sample_data : This is an R dataframe that contains all the metadata about your samples (sampleID, treatment, DNA yield, name of sampler, etc.). Rows contain each one of your samples, and columns each of the variables in the metadata.
    • taxonomyTable : This is an R matrix that contains the taxonomic assignments for each of the ASVs. ASVs exist in the rows, and the classification is in columns, on different ranks.
    • DNAStringSet : This is a special Biostrings object that contains the sequence of each ASV, its ASV identifier and its length.

2.1 Accessing your data

Since the phyloseq object is a special kind of object, each data object will be accesible differently:

otu_table(ps)
sample_data(ps)
tax_table(ps)

phyloseq provides a set of functions to quickly retrieve useful information from your phyloseq object… they are self-explicative, so try them and figure out what they do:

library(phyloseq)
ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 3943 taxa and 35 samples ]
## sample_data() Sample Data:       [ 35 samples by 12 sample variables ]
## tax_table()   Taxonomy Table:    [ 3943 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 3943 reference sequences ]
rank_names(ps)
## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"
sample_variables(ps)
##  [1] "id_sample"      "treatment"      "DNAconc"        "inoculum"      
##  [5] "fruit"          "time"           "label"          "NovoID"        
##  [9] "student_Number" "Family_Name"    "First_name"     "AllFactor"
get_variable(ps) 
ntaxa(ps)
## [1] 3943
nsamples(ps)
## [1] 35
sample_names(ps)
##  [1] "K1"  "K10" "K11" "K12" "K13" "K14" "K15" "K16" "K17" "K18" "K19" "K2" 
## [13] "K20" "K21" "K22" "K23" "K24" "K25" "K26" "K27" "K28" "K29" "K3"  "K30"
## [25] "K31" "K32" "K33" "K34" "K35" "K4"  "K5"  "K6"  "K7"  "K8"  "K9"
taxa_names(ps)[1:20] # Just the first 20
##  [1] "ASV1"  "ASV2"  "ASV3"  "ASV4"  "ASV5"  "ASV6"  "ASV7"  "ASV8"  "ASV9" 
## [10] "ASV10" "ASV11" "ASV12" "ASV13" "ASV14" "ASV15" "ASV16" "ASV17" "ASV18"
## [19] "ASV19" "ASV20"
sample_sums(ps)
##     K1    K10    K11    K12    K13    K14    K15    K16    K17    K18    K19 
## 101652  69731  96825  92202  87283  68179  64264  95494  94088  66573  78783 
##     K2    K20    K21    K22    K23    K24    K25    K26    K27    K28    K29 
##  88323  96428  82532  85009  81102  76271  98489  96448 100423 100134  72124 
##     K3    K30    K31    K32    K33    K34    K35     K4     K5     K6     K7 
##  85544  83242  76512  81086  78832  89789  68099  89441  93646  63712  96747 
##     K8     K9 
## 101635  92416
taxa_sums(ps)[1:20] # Just the first 20
##    ASV1    ASV2    ASV3    ASV4    ASV5    ASV6    ASV7    ASV8    ASV9   ASV10 
## 1402056  668284  169362  122123   91989   85013   51502   32011   26215   25768 
##   ASV11   ASV12   ASV13   ASV14   ASV15   ASV16   ASV17   ASV18   ASV19   ASV20 
##   24726   16148   14744   14068   12061    8876    7889    6810    6072    4764
get_taxa(ps,"K11")[1:20]      # Just the furst 20 ASV abundances for ONE sample
##  ASV1  ASV2  ASV3  ASV4  ASV5  ASV6  ASV7  ASV8  ASV9 ASV10 ASV11 ASV12 ASV13 
## 21946 45282  6003  3736  2367  7282   239  2134   384     0   410   220   409 
## ASV14 ASV15 ASV16 ASV17 ASV18 ASV19 ASV20 
##    31   165   316   537    26    19   121
get_sample(ps,"ASV1")   # Abundances in ALL samples for ONE OTU. 
##    K1   K10   K11   K12   K13   K14   K15   K16   K17   K18   K19    K2   K20 
## 46697 23264 21946 22921 28854 31401 31853 72118 50449 32366 28492 41991 58425 
##   K21   K22   K23   K24   K25   K26   K27   K28   K29    K3   K30   K31   K32 
## 38411 37748 10634 22326 55201 54925 62010 19252 38184 53705 51182 51895 59333 
##   K33   K34   K35    K4    K5    K6    K7    K8    K9 
## 46004 33661 37910 29042 57836 43712 47787 44992 15529

You can manipulate these as common R objects:

class( sample_sums(ps) )
## [1] "numeric"
str( sample_sums(ps) )
##  Named num [1:35] 101652 69731 96825 92202 87283 ...
##  - attr(*, "names")= chr [1:35] "K1" "K10" "K11" "K12" ...
head(sort(sample_sums(ps),decreasing = TRUE),35)
##     K1     K8    K27    K28    K25    K11     K7    K26    K20    K16    K17 
## 101652 101635 100423 100134  98489  96825  96747  96448  96428  95494  94088 
##     K5     K9    K12    K34     K4     K2    K13     K3    K22    K30    K21 
##  93646  92416  92202  89789  89441  88323  87283  85544  85009  83242  82532 
##    K23    K32    K33    K19    K31    K24    K29    K10    K14    K35    K18 
##  81102  81086  78832  78783  76512  76271  72124  69731  68179  68099  66573 
##    K15     K6 
##  64264  63712
sample_sums(ps)["K11"]
##   K11 
## 96825
taxa_sums(ps)[1:15]
##    ASV1    ASV2    ASV3    ASV4    ASV5    ASV6    ASV7    ASV8    ASV9   ASV10 
## 1402056  668284  169362  122123   91989   85013   51502   32011   26215   25768 
##   ASV11   ASV12   ASV13   ASV14   ASV15 
##   24726   16148   14744   14068   12061
taxa_sums(ps)[c("ASV1","ASV10","ASV100")]
##    ASV1   ASV10  ASV100 
## 1402056   25768     420
sample_data(ps)$AllFactor <- rep("ALL",35)

We now have everything we need to start exploring our dataset!

3 Exploring your dataset

3.1 Subsetting the dataset

We will first need to get rid of one sample that corresponds to Vladimir’s kefir, from which Philipp’s and Vincent’s kefir originate. Vladimir’s kefir composition will not be studied in this tutorial.

ps = subset_samples(ps, First_name != "Vladimir")
ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 3943 taxa and 34 samples ]
## sample_data() Sample Data:       [ 34 samples by 12 sample variables ]
## tax_table()   Taxonomy Table:    [ 3943 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 3943 reference sequences ]

3.2 Filtering and decontamination

We also need to filter a few taxa that do not correspond to bacterial ASVs, like chloroplasts and mitochondria.

A PCR reaction can effectively amplify templates even more diluted than 1 copy per 50 uL. Unfortunately, this means that it can also amplify undesired DNA from contaminant genomic DNA present in the sample, such as host’s mitochondria and chloroplasts, bacteria in transit in the environment or human contamination due to sample handling. The effect will be even larger in samples with very low biomass of the desired community.

There are many ways to control for contaminants, but none is standardized. Monitoring total DNA extracted before preparing the libraries is a good way to detect low-yield samples. Running blank or ghost DNA extractions (without the desired tissue or sample) are good to detect contaminants in reagents.

First of all, we will search for sequences assigned to mitochondria from the yeasts or chloroplasts from the figs and raisins.

rank_names(ps)
## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"
# Check if there are any ASVs classified as eukaryotic or NA
get_taxa_unique(ps,"Kingdom")
## [1] "Bacteria" NA         "Archaea"
# Remove ASVs assigned to chloroplasts or mitochondria
ps <- subset_taxa(ps, Kingdom != "Archaea" | is.na(Kingdom))
ps <- subset_taxa(ps, Order != "Chloroplast" | is.na(Order))
ps <- subset_taxa(ps, Family != "Mitochrondria" | is.na(Family))

Your object ps should not contain any ASVS classified as chloroplasts or mitochondria.

3.3 Rarefaction

Differences in sampling effort (sequening depth) can strongly affect diversity analyses, because samples that appear to have the same diversity might actually be sampled at different depths.

One technique to evaluate and compare the sampling depth of a community is rarefaction. Rarefaction will subsample each sample at different sequencing depths, and estimate how many ASVs you would have if you had sampled at that specific depth. A cumulative rarefaction curve then is plotted by joining the expectations at different depths. As such, it is also a measure of estimated richness.

ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 3882 taxa and 34 samples ]
## sample_data() Sample Data:       [ 34 samples by 12 sample variables ]
## tax_table()   Taxonomy Table:    [ 3882 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 3882 reference sequences ]
pl.rare.all <- ggrare(ps, step = 1000, label = "Sample", se = FALSE) + xlab("Sequencing Depth") + ylab("ASV Richness")
## rarefying sample K1
## rarefying sample K10
## rarefying sample K11
## rarefying sample K12
## rarefying sample K13
## rarefying sample K14
## rarefying sample K15
## rarefying sample K16
## rarefying sample K17
## rarefying sample K18
## rarefying sample K19
## rarefying sample K2
## rarefying sample K20
## rarefying sample K21
## rarefying sample K22
## rarefying sample K23
## rarefying sample K24
## rarefying sample K25
## rarefying sample K26
## rarefying sample K27
## rarefying sample K28
## rarefying sample K29
## rarefying sample K3
## rarefying sample K30
## rarefying sample K31
## rarefying sample K32
## rarefying sample K33
## rarefying sample K34
## rarefying sample K4
## rarefying sample K5
## rarefying sample K6
## rarefying sample K7
## rarefying sample K8
## rarefying sample K9

What you need to pay attention to:

  • The height of the curve at its final point is the observed richness in your sample (all other points are below because it was subsampled from this!)
  • The slope of the curve at the final point, which tells you how many more ASVs are you discovering each time you sample 1000 reads more. A curve at 45° between the last two sampling points is growing at a 1:1 ratio, and would mean that each new read you sample is a new ASV you hadn’t seen before. A completely flat curve at 0° would mean no new ASVs were found, and that all 1000 reads sampled at the final point belong to ASVs you had already found before.
  • The minimum sequencing depth at which all samples are contained.

What have you learnt from the rarefaction curve?

  • How well-sampled are our samples? Would you recommend more sequencing? Or would you recommend less sequencing depth next time?
  • How heterogeneous is the sequencing depth?
  • How fast are the curves flattening? What does this mean?
  • Can we use these samples like this? How could it possibly affect our analyses?

Here, because the samples are not too heterogeneous in their sequencing depths, and because the smaller samples (K6 and K15) have a sequencing depth at which all other samples have reached their plateau, there is no need to rarefy, we would just lose information.

3.4 Prevalence

Another way to infer how many sampled bacteria were in transit (randomly sampled from the environment), is to count in how many samples each ASV is found with an abundance larger than zero. We call this the prevalence of an ASV.

In brief, those bacteria that are central components essential for your ecosystem will be present in all samples, and in relatively large proportions. ASVs from accidentally sampled bacteria, contaminants or artifacts will have a prevalence around 1 and generally in low proportions.

# Calculate prevalence across all samples
ps.prev <- estimate_prevalence(ps,group="AllFactor",rarefy = FALSE)
ps.prev$samplePrev <- ps.prev$prevalence * 35 # Convert to absolute sample numbers
ps.prev$relAbund <- ( ps.prev$abundance / sum(ps.prev$abundance) ) # Calculate relative abundance 

#Plot
pl.prev.his <- ggplot(ps.prev, aes(x=samplePrev))+
  geom_histogram(colour="steelblue",fill="steelblue", binwidth = 1, boundary = -0.5) +
  scale_x_continuous(breaks = 1:35) +
  xlab("Number of samples in which ASVs are found") +
  ylab("Number of ASVs") +
  stat_bin(binwidth=1, geom='text', colour="black", size = 2, aes(label=..count..), position=position_stack(vjust = 1.1))
pl.prev.his

## Compare prevalence with relative abundance
# select the "Phylum" taxonomic assignation for your ASVs
ps.tax.class <- data.frame(tax_table(ps)[,2]) 
# Plot
pl.prev.sct <- ggplot(ps.prev, aes(x=samplePrev, y=relAbund, color=ps.tax.class$Phylum)) +
  geom_point() +
  theme(legend.title=element_blank(),legend.text = element_text(size = 8),
        legend.position="bottom") +
  xlab("Number of samples in which ASVs are found") +
  ylab("Relative abundance")
pl.prev.sct

  • From these plots we can learn the following:
    • Does the majority of the ASVs have high or low prevalence?
    • Are there any ASVs occurring in all samples? Were we expecting it?
    • Are the most abundant ASVs dominating a single or few samples or are they present in all samples?

3.5 Richness

Most of what we’ve done so far depends on the number of ASVs we have per sample or richness. The richness of a sample can be affected by sequencing effort, DNA yields, volume sampled and many other factors. Let’s calculate ASV richness and compare it to the total number of reads per sample.

# Calculate how many total ASVs 
tmp_df <- t(otu_table(ps))
ps.asv <- data.frame(colSums(tmp_df!=0))
colnames(ps.asv) <- "ASVs"
rm(tmp_df)

# Now add the total read number
ps.asv$reads <- sample_sums(ps)
colnames(ps.asv) <- c("ASVs", "reads")
ps.asv$sampleID <- rownames(ps.asv)

# Make barplots for each sample:
pl.asv.bar <- ggplot(ps.asv, aes(x = sampleID, y = ASVs) ) + 
  ggtitle("Total ASVs per sample") + 
  ylab("ASV number") +
  geom_bar(stat="identity", fill="steelblue")+
  theme(axis.text.x = element_text(angle=90))

pl.red.bar <- ggplot(ps.asv, aes(x = sampleID, y = reads) ) + 
  ggtitle("Total reads per sample") + 
  ylab("Read number") +
  geom_bar(stat="identity", fill="indianred")+
  theme(axis.text.x = element_text(angle=90))

# Histogram of ASVs and read counts
pl.asv.his <- ggplot(ps.asv, aes(x = reads)) + 
  geom_histogram(color = "black", fill = "indianred", binwidth = 5000) +
  ggtitle("Total reads distribution") + 
  xlab("Reads") +
  ylab("Samples")+
  theme(axis.text.x = element_text(angle=90))

pl.red.his <- ggplot(ps.asv, aes(x = ASVs)) + 
  geom_histogram(color = "black", fill = "steelblue", binwidth = 50) +
  ggtitle("Total ASV distribution") + 
  xlab("ASVs") +
  ylab("Samples")+
  theme(axis.text.x = element_text(angle=90))

# Visualize
ggarrange(pl.red.bar, pl.asv.bar, pl.asv.his, pl.red.his, ncol=2, nrow=2)

summary(ps.asv)
##       ASVs           reads          sampleID        
##  Min.   : 35.0   Min.   : 63418   Length:34         
##  1st Qu.: 78.0   1st Qu.: 78697   Class :character  
##  Median :156.5   Median : 86350   Mode  :character  
##  Mean   :185.6   Mean   : 85758                     
##  3rd Qu.:251.8   3rd Qu.: 95910                     
##  Max.   :498.0   Max.   :101518
  • What can you learn from these plots?
    • How variable is the total ASV richness across samples?

It looks like we might be seeing more ASVs in samples that were sequenced more deeply (yielded more reads). Let’s see if we can have a better picture of this:

# Compare ASVs against Reads
summary(lm(ps.asv$ASVs ~ ps.asv$reads))
## 
## Call:
## lm(formula = ps.asv$ASVs ~ ps.asv$reads)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -155.91  -86.74  -40.08   77.89  301.23 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -65.713512 161.642057  -0.407    0.687
## ps.asv$reads   0.002931   0.001869   1.568    0.127
## 
## Residual standard error: 122.4 on 32 degrees of freedom
## Multiple R-squared:  0.07136,    Adjusted R-squared:  0.04234 
## F-statistic: 2.459 on 1 and 32 DF,  p-value: 0.1267
pl.asv_red.scp <- ggplot(ps.asv, aes(x=reads, y=ASVs)) + 
  geom_point()+
  geom_smooth(method=lm)

## Compare ASVs against DNA yields
# Add DNA yields from the sample metadata
tmp1 <- sample_data(ps)[,"DNAconc"]
ps.asv2 <- merge(ps.asv, tmp1, by="row.names", all=TRUE)  
rownames(ps.asv2) <- ps.asv2$sampleID
ps.asv2[,"Row.names"] <- NULL

# Compare ASVs against DNA yield
summary(lm(ps.asv2$ASVs ~ ps.asv2$DNAconc))
## 
## Call:
## lm(formula = ps.asv2$ASVs ~ ps.asv2$DNAconc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -150.35 -110.41  -25.80   66.65  312.55 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     177.1807    38.6265   4.587 6.57e-05 ***
## ps.asv2$DNAconc   0.1133     0.4285   0.264    0.793    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 126.8 on 32 degrees of freedom
## Multiple R-squared:  0.002179,   Adjusted R-squared:  -0.029 
## F-statistic: 0.06987 on 1 and 32 DF,  p-value: 0.7932
pl.asv_dna.scp <- ggplot(ps.asv2, aes(x=DNAconc, y=ASVs)) + 
  geom_point()+
  geom_smooth(method=lm)
rm(tmp1, tmp2)

# Visualize
ggarrange(pl.asv_dna.scp, pl.asv_red.scp, ncol=2, nrow=1)

  • Now we can decide:
    • Is the number of ASVs biased by the DNA extraction yields? Is it statistically significant?
    • Is the number of ASVs biased by the sequencing depth? Is it statistically significant?

3.6 Composition - Taxon-level analysis

Now let’s see what is the composition of each sample and how similar it is to what has been previously published.

## First lets count how many Phyla are in each sample
get_taxa_unique(ps,taxonomic.rank = "Phylum")
##  [1] "Firmicutes"                    "Actinobacteriota"             
##  [3] "Proteobacteria"                "Desulfobacterota"             
##  [5] "Bacteroidota"                  "Verrucomicrobiota"            
##  [7] "Cyanobacteria"                 "Campilobacterota"             
##  [9] "Deferribacterota"              NA                             
## [11] "Marinimicrobia_(SAR406_clade)" "SAR324_clade(Marine_group_B)" 
## [13] "Planctomycetota"               "Acidobacteriota"              
## [15] "Chloroflexi"                   "RCP2-54"                      
## [17] "Dadabacteria"                  "Fibrobacterota"               
## [19] "Abditibacteriota"              "NB1-j"                        
## [21] "Nitrospirota"                  "Gemmatimonadota"              
## [23] "Myxococcota"                   "Bdellovibrionota"             
## [25] "Armatimonadota"                "Fusobacteriota"               
## [27] "Spirochaetota"                 "Entotheonellaeota"            
## [29] "Synergistota"                  "WPS-2"                        
## [31] "Elusimicrobiota"               "FCPU426"                      
## [33] "Nitrospinota"                  "Methylomirabilota"            
## [35] "PAUC34f"                       "Latescibacterota"             
## [37] "Patescibacteria"               "Deinococcota"                 
## [39] "Deferrisomatota"               "Calditrichota"                
## [41] "Acetothermia"                  "Dependentiae"                 
## [43] "Sva0485"
# Let''s look at the distribution of phyla in each sample
ps.phy <- tax_glom(ps, taxrank="Phylum")
ps.phy.df <- psmelt(ps.phy)
# Plot each treatment group next to each other
ps.phy.df %>%
  ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Phylum")) +
  geom_bar(stat = "identity", position = "stack") + 
  theme_bw() + 
  theme(legend.position = "bottom", axis.text.x =  element_text(angle = 80, vjust = 0.5)) +
  ggtitle("Relative abundance at the phylum level")

# Now let's do the same thing at the genus level, displaying only the 20 most abundant genera
# select top 10 most prevalent Families over the whole dataset
ps.gen <- tax_glom(ps, taxrank="Genus")
Top10_gen <- names(sort(taxa_sums(ps.gen), TRUE)[1:10])
ps.gen <- prune_taxa(Top10_gen, ps.gen)
# Melt them for plotting
ps.gen.df <- psmelt(ps.gen)

# Plot each treatment group next to each other
ps.gen.df %>%
  ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Genus")) +
  geom_bar(stat = "identity", position = "stack") + 
  theme_bw() + 
  theme(legend.position = "bottom", axis.text.x =  element_text(angle = 80, vjust = 0.5)) +
  ggtitle("Relative abundance at the Genus level")

  • In this barplot, each bar corresponds to one sample, and its height corresponds to the total number of reads in it. Each box inside the bars represent one phylum/genus (or any other chosen taxonomic rank) in that sample, and their size is proportional to the abundance of that taxon in that sample. The taxon boxes are coloured by their assigned taxon (but more on this later).
    • Are there any dominant taxon?
    • Is this similar to what has been reported in the literature?

3.7 Filtering your dataset

By now you probably realized that there are lots of low-prevalence and low-abundance (rare) ASVs in our samples, that are either contaminats or simply noise in our dataset. Thus we will now filter ASVs that are too rare and in very few samples, and retain those with larger abundance and prevalence.

To counter the differences in sequencing depth, we will transform our data with absolute read counts to proportions, where the sum of all reads per sample equals 1. Then we will retain only ASVs whose relative abundance is larger than 1% (0.01) in at least 1 sample. In this way, we can keep ASVs that are relatively abundant in a few samples too.

library(genefilter)
ps.prop = transform_sample_counts(ps, function(x) {x/sum(x)})
ps.filt = filter_taxa(ps.prop, filterfun(kOverA(1, 0.01)), TRUE)

# Then save an object with the same ASVs but with raw counts
toKeep <- taxa_names(ps.filt)
ps.rawfilt <- prune_taxa(toKeep, ps)

# Fix tax_table
tmptax <- data.frame(tax_table(ps.filt))
tmptax$Linnaeus <- paste(tmptax$Genus,tmptax$Species )
tax_table(ps.filt) <- as.matrix(tmptax)
tax_table(ps.rawfilt) <- as.matrix(tmptax)

# Plot again at Genus level
plot_bar(ps.filt, x="id_sample",fill="Genus") 

plot_bar(ps.rawfilt, x="id_sample",fill="Genus") 

# Plot at species level
plot_bar(ps.filt, x="id_sample",fill="Linnaeus" ) +
  theme(legend.title=element_blank(),legend.text = element_text(size = 8), legend.position="right")+
  guides(color=guide_legend(ncol=1, byrow=FALSE))

plot_bar(ps.rawfilt, x="id_sample",fill="Linnaeus" ) +
  theme(legend.title=element_blank(),legend.text = element_text(size = 8), legend.position="right")+
  guides(color=guide_legend(ncol=1, byrow=FALSE))

  • Did rare ASVs that were removed from this filtering step represent a large proportion of all ASVs in our samples?

4 Diversity Analyses

Now that we have cleaned and filtered our dataset, we can perform the proper diversity analyses.

4.1 Alpha diversity

Let’s first have a look at the local, individual, diversity Alpha diversity in the samples.

Remember that we want to answer three questions: Question 1: How similar are the communities in the two kefir starter cultures? Question 2: How stable are the communities over time? Question 3: Do they change in response to the environment?

  • For each of these questions, we will calculate three indices:
    • Observed is the total number of ASVs in the sample, equal to Richness
    • Simpson is the probability of encounter, and thus a measure of dominance.
    • Shannon is a measure of evenness, weighing equally richness and abundance.
#ps.rawfilt.q1 = subset_samples(ps.rawfilt, time!="END")
ps.q3 = subset_samples(ps, time!="START")
ps.filt.q3 = subset_samples(ps.filt, time!="START")
ps.rawfilt.q3 = subset_samples(ps.rawfilt, time!="START")
# Question1: How similar are the communities in the two kefir starter cultures?
pl.alpha <- plot_richness(ps.rawfilt, color = "inoculum", x="inoculum", measures=c("Observed","Shannon", "Simpson"))
pl.alpha.box <- pl.alpha + geom_boxplot()
plot(pl.alpha.box)

# Let's run a simple ANOVA on alpha diversity indices
ps.alpha <- estimate_richness(ps.rawfilt, measures=c("Observed","Shannon", "Simpson"))
ps.dataA <- cbind(sample_data(ps.rawfilt), ps.alpha)
ps.dataA$reads <- sample_sums(ps)
ps.alpha.aov <- aov(Observed ~ inoculum + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## inoculum     1  17.50   17.50   3.963 0.0557 .
## reads        1  31.69   31.69   7.175 0.0119 *
## DNAconc      1   5.76    5.76   1.305 0.2624  
## Residuals   30 132.49    4.42                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ps.alpha.aov <- aov(Shannon ~ inoculum + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## inoculum     1 0.4820  0.4820   6.150  0.019 *
## reads        1 0.0001  0.0001   0.002  0.969  
## DNAconc      1 0.1896  0.1896   2.418  0.130  
## Residuals   30 2.3515  0.0784                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ps.alpha.aov <- aov(Simpson ~ inoculum + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## inoculum     1 0.0682 0.06822   5.444 0.0265 *
## reads        1 0.0019 0.00188   0.150 0.7010  
## DNAconc      1 0.0253 0.02534   2.022 0.1654  
## Residuals   30 0.3760 0.01253                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Question 2: How stable are the communities over time?
pl.alpha <- plot_richness(ps.rawfilt, color = "time", x="time", measures=c("Observed","Shannon", "Simpson"))
pl.alpha.box <- pl.alpha + geom_boxplot()
plot(pl.alpha.box)

# Let's run a simple ANOVA on alpha diversity indices
ps.alpha <- estimate_richness(ps.rawfilt, measures=c("Observed","Shannon", "Simpson"))
ps.dataA <- cbind(sample_data(ps.rawfilt), ps.alpha)
ps.dataA$reads <- sample_sums(ps)
ps.alpha.aov <- aov(Observed ~ time + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## time         1   0.82   0.825   0.173 0.6801  
## reads        1  21.24  21.239   4.464 0.0430 *
## DNAconc      1  22.66  22.657   4.762 0.0371 *
## Residuals   30 142.72   4.757                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ps.alpha.aov <- aov(Shannon ~ time + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)
## time         1 0.0461 0.04607   0.469  0.499
## reads        1 0.0230 0.02300   0.234  0.632
## DNAconc      1 0.0076 0.00762   0.078  0.782
## Residuals   30 2.9465 0.09822
ps.alpha.aov <- aov(Simpson ~ time + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq  Mean Sq F value Pr(>F)
## time         1 0.0046 0.004638   0.299  0.589
## reads        1 0.0002 0.000152   0.010  0.922
## DNAconc      1 0.0006 0.000613   0.039  0.844
## Residuals   30 0.4660 0.015534
# Question 3: Do they change in response to the environment (the fruit)?
pl.alpha <- plot_richness(ps.rawfilt.q3, color = "fruit", x="fruit", measures=c("Observed","Shannon", "Simpson"))
pl.alpha.box <- pl.alpha + geom_boxplot()
plot(pl.alpha.box)

# Let's run a simple ANOVA on alpha diversity indices
ps.alpha <- estimate_richness(ps.rawfilt.q3, measures=c("Observed","Shannon", "Simpson"))
ps.dataA <- cbind(sample_data(ps.rawfilt.q3), ps.alpha)
ps.dataA$reads <- sample_sums(ps.q3)
ps.alpha.aov <- aov(Observed ~ fruit + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## fruit        1   0.21   0.215   0.048 0.8275  
## reads        1  21.33  21.327   4.806 0.0375 *
## DNAconc      1  16.96  16.958   3.822 0.0614 .
## Residuals   26 115.37   4.437                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ps.alpha.aov <- aov(Shannon ~ fruit + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)
## fruit        1 0.0691 0.06913   0.659  0.424
## reads        1 0.0253 0.02531   0.241  0.627
## DNAconc      1 0.0343 0.03433   0.327  0.572
## Residuals   26 2.7269 0.10488
ps.alpha.aov <- aov(Simpson ~ fruit + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq  Mean Sq F value Pr(>F)
## fruit        1 0.0146 0.014581   0.885  0.356
## reads        1 0.0006 0.000572   0.035  0.854
## DNAconc      1 0.0026 0.002550   0.155  0.697
## Residuals   26 0.4286 0.016484
  • Do you see any differences between Philipp’s and Vincent’s kefir cultures? Between start and end points? Between figs and raisins?

We can all also combine all the comparisons into one single variable called “treatment” with the following code:

pl.alpha <- plot_richness(ps.rawfilt, color = "treatment", x="treatment", measures=c("Observed","Shannon", "Simpson"))
pl.alpha.box <- pl.alpha + geom_boxplot()
plot(pl.alpha.box)

# Let's run a simple ANOVA on richness (the other metrics are non-linear)
ps.alpha <- estimate_richness(ps.rawfilt, measures=c("Observed","Shannon", "Simpson"))
ps.dataA <- cbind(sample_data(ps.rawfilt), ps.alpha)
ps.dataA$reads <- sample_sums(ps)
ps.alpha.aov <- aov(Observed ~ treatment + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## treatment    4  24.28    6.07   1.395 0.26221   
## reads        1  36.11   36.11   8.301 0.00767 **
## DNAconc      1   9.58    9.58   2.201 0.14946   
## Residuals   27 117.47    4.35                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.2 Composition - taxon-level analysis

Let’s plot again the community compositions, but now aggregating samples by their treatment. You can also aggregate them according to other variables in the sample_data!

# Now let's do the same thing at the genus level, displaying only the 20 most abundant genera
# select top 10 most prevalent Families over the whole dataset
ps.gen <- tax_glom(ps.filt, taxrank="Genus")
Top10_gen <- names(sort(taxa_sums(ps.gen), TRUE)[1:10])
ps.gen <- prune_taxa(Top10_gen, ps.gen)
# Melt them for plotting
ps.gen.df <- psmelt(ps.gen)

# Plot each treatment group next to each other
ps.gen.df %>%
  ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Genus")) +
  geom_bar(stat = "identity", position = "stack") + 
  theme_bw() + 
  facet_grid(~ inoculum, scales = "free_x", space = "free_x") +
  theme(legend.position = "bottom", axis.text.x =  element_text(angle = 80, vjust = 0.5)) +
  ggtitle("Relative abundance at the Genus level between Philipp's and Vincent's kefir cultures")

ps.gen.df %>%
  ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Genus")) +
  geom_bar(stat = "identity", position = "stack") + 
  theme_bw() + 
  facet_grid(~ time, scales = "free_x", space = "free_x") +
  theme(legend.position = "bottom", axis.text.x =  element_text(angle = 80, vjust = 0.5)) +
  ggtitle("Relative abundance at the Genus level between start and end cultures")

ps.gen.q3 <- tax_glom(ps.filt.q3, taxrank="Genus")
Top10_gen <- names(sort(taxa_sums(ps.gen.q3), TRUE)[1:10])
ps.gen.q3 <- prune_taxa(Top10_gen, ps.gen.q3)
# Melt them for plotting
ps.gen.q3.df <- psmelt(ps.gen.q3)

ps.gen.q3.df %>%
  ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Genus")) +
  geom_bar(stat = "identity", position = "stack") + 
  theme_bw() + 
  facet_grid(~ fruit, scales = "free_x", space = "free_x") +
  theme(legend.position = "bottom", axis.text.x =  element_text(angle = 80, vjust = 0.5)) +
  ggtitle("Relative abundance at the Genus level between figs and raisins")

  • After inspecting the plots…
    • Do you see any pattern between comparison groups?
    • How much influence do you think the starting culture had on the rest of the samples?

As previously, we can all also combine all the comparisons into one single variable called “treatment” with the following code:

# Plot the proportions to better compare the communities
pl.com.pro <- plot_bar(ps.filt, x="id_sample",fill="Linnaeus") +        
  facet_wrap(~treatment, scales="free_x", nrow=1) +
  theme(legend.title=element_blank(),legend.text = element_text(size = 8), legend.position="bottom")+
  guides(color=guide_legend(ncol=6, byrow=FALSE))
pl.com.pro

4.3 Beta diversity

Now we are going to quantify how (dis)similar our communities are. We will use two distance metrics, and visualize them with an ordination.

# Ordinate using Principal Coordinate Analysis & NMDS
ps.nmds.bray <- ordinate(ps.rawfilt, "NMDS", "bray",trymax=50)
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1396999 
## Run 1 stress 0.1421336 
## Run 2 stress 0.1418913 
## Run 3 stress 0.154612 
## Run 4 stress 0.1391781 
## ... New best solution
## ... Procrustes: rmse 0.1009367  max resid 0.2512856 
## Run 5 stress 0.1498729 
## Run 6 stress 0.1569684 
## Run 7 stress 0.1414483 
## Run 8 stress 0.1410611 
## Run 9 stress 0.1500274 
## Run 10 stress 0.1494211 
## Run 11 stress 0.1441332 
## Run 12 stress 0.1421798 
## Run 13 stress 0.151331 
## Run 14 stress 0.1579865 
## Run 15 stress 0.1507755 
## Run 16 stress 0.1410608 
## Run 17 stress 0.1382355 
## ... New best solution
## ... Procrustes: rmse 0.09276318  max resid 0.2476177 
## Run 18 stress 0.1411982 
## Run 19 stress 0.1422367 
## Run 20 stress 0.1436699 
## Run 21 stress 0.1421926 
## Run 22 stress 0.1412372 
## Run 23 stress 0.1410612 
## Run 24 stress 0.1391757 
## Run 25 stress 0.1522416 
## Run 26 stress 0.1395976 
## Run 27 stress 0.1410607 
## Run 28 stress 0.1411982 
## Run 29 stress 0.1387484 
## Run 30 stress 0.1456364 
## Run 31 stress 0.1395978 
## Run 32 stress 0.1397051 
## Run 33 stress 0.1412372 
## Run 34 stress 0.1498728 
## Run 35 stress 0.1533121 
## Run 36 stress 0.1397057 
## Run 37 stress 0.1434789 
## Run 38 stress 0.143791 
## Run 39 stress 0.1411981 
## Run 40 stress 0.1447864 
## Run 41 stress 0.1397004 
## Run 42 stress 0.1411987 
## Run 43 stress 0.1458232 
## Run 44 stress 0.1534609 
## Run 45 stress 0.1437985 
## Run 46 stress 0.1459104 
## Run 47 stress 0.1391795 
## Run 48 stress 0.1441332 
## Run 49 stress 0.1395977 
## Run 50 stress 0.1503134 
## *** Best solution was not repeated -- monoMDS stopping criteria:
##     49: stress ratio > sratmax
##      1: scale factor of the gradient < sfgrmin
# Check stress level and dimensions used
ps.nmds.bray
## 
## Call:
## metaMDS(comm = veganifyOTU(physeq), distance = distance, trymax = 50) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     wisconsin(sqrt(veganifyOTU(physeq))) 
## Distance: bray 
## 
## Dimensions: 2 
## Stress:     0.1382355 
## Stress type 1, weak ties
## Best solution was not repeated after 50 tries
## The best solution was from try 17 (random start)
## Scaling: centring, PC rotation, halfchange scaling 
## Species: expanded scores based on 'wisconsin(sqrt(veganifyOTU(physeq)))'
# Plot NMDS
pl.nmds.bray <- plot_ordination(ps.rawfilt, ps.nmds.bray,color="treatment",title="NDMS of Bray-Curtis dissimilarities",label="id_sample" ) +
  stat_ellipse(aes(group = inoculum))
pl.nmds.bray

And let’s test if the differences in the composition of communities from the same group are smaller than between groups using a Permutational Multivariate ANOVA (PerMANOVA)

# Calculate Bray-Curtis distance
ps.bray <- phyloseq::distance(ps.rawfilt,"bray")

# Test with ADONIS
ps.adonis.int <- adonis2(ps.bray ~ reads + (inoculum*fruit), data = ps.dataA, perm =9999)
ps.adonis.int
  • What seems to be the most influential factor explaining the differences between bacterial kefir communities?